Introduction

This report describes an analysis of raw counts of Nanostring GeoMx data provided by Diana Piol, and analyzed using the DESeq2 Bioconductor package. This report analyzes the inferred transcriptional changes between experimental groups and produces statistical diagnostic plots and lists of differentially expressed (DE) genes for each experimental comparison, and saves these resulting DE gene lists as Excel files.

## load libraries
library("dplyr")
library("tidyr")
library("tibble")
library("ggplot2")
library("ggrepel")
library("DESeq2")
library("RColorBrewer")
library("pheatmap")
library("readxl")
library("readr")
library("writexl")
library("janitor")
library("gridExtra")
library("biomaRt")
library("DT")

## set project variables
proj <- "NanoString"
sel_comp <- "CTRL"
tissue <- "spinal_cord"
sel_comp1 <- "Chatpos_vs_Chatneg"

Experimental design

We load the complete Nanostring count matrix of all samples and metadata describing the conditions of these samples. The metadata and matrix counts from NanoString data are subset to contain only CTRL spinal cord samples. The metadata (or “colData”) for the samples being compared is displayed in an HTML table shown below:

The following conditions being compared within CTRL samples:

  • Chatpos vs Chatneg (across only CTRL)

Note: all comparisons are based relative to the first comparison listed, which is used as the normative control

Differential expression analysis

For this project, we are using the following model: ~ segment

The standard differential expression analysis steps are wrapped into a single function, DESeq. The estimation steps performed by this function are described in the DESeq2 vignette, and in the manual page for ?DESeq and in the Methods section of the DESeq2 publication (Love, Huber, and Anders 2014).

dds <- DESeqDataSetFromMatrix(countData = cts %>% as.matrix(), 
                              colData = meta_data,
                              design = ~ segment)
cat("Number of genes in dds object *before* filtering out rows with zero counts \n")
#> Number of genes in dds object *before* filtering out rows with zero counts
dim(assay(dds))[1]
#> [1] 19959

For QC reasons, genes with counts of all 0 for all samples (i.e. rows in the count matrix with no count measurements) are removed before running DESeq. This is because for rows with all zero counts no variance can be modeled.

keep <- rowSums(counts(dds)) > 0
dds <- DESeq(dds[keep, ])
cat("Number of genes in dds object *after* filtering out rows with zero counts \n")
#> Number of genes in dds object *after* filtering out rows with zero counts
dim(assay(dds))[1]
#> [1] 19958

Results tables are generated using the function results, which extracts a results table with log2 fold changes, p-values and adjusted p-values. With no additional arguments to results, the log2 fold change and Wald test p-value will be for the first variable in the design formula, the experiment group will be the last variable.

res1 <- results(dds, contrast = c("segment", "ChATpos", "ChATneg"), alpha = 0.05)

Data exploration and quality assessment (QC)

Principal components analysis (PCA)

PCA is a method of visually identifying the similarity or difference between samples. PCA rotates the data cloud onto an orthogonal basis determined by the dimensions of maximal variance. The first two Principal Components (PCs) usually hold the majority of the variance of the data. The following plots show the variance stabilized transformed count matrix samples projected onto the two largest Principal Components (i.e. PC1 and PC2). DESeq2 recommends two types of PCA stabilizations that are performed prior to creating the PCA: * vst “variance stabilizing transformation” * rld “regularized log transformation”

## perform variance stabilizing transformation and PCA and plot with ggplot2
vst <- DESeq2::vst(dds)
pcaData <- plotPCA(vst, intgroup = c("segment"), returnData = TRUE)
percentVar <- round(100 * attr(pcaData, "percentVar"))
pcaData %>% 
  ggplot(aes(PC1, PC2, color = segment)) +
  geom_point(size = 3) +
  geom_text_repel(aes(label = colnames(vst)), force = 5,
                  arrow = arrow(length = unit(0.03, "npc"),
                                type = "closed", ends = "first")) +
  xlab(paste0("PC1: ",percentVar[1],"% variance")) +
  ylab(paste0("PC2: ",percentVar[2],"% variance")) + 
  theme(axis.ticks.x = element_blank(), axis.text.x = element_blank(),
        axis.ticks.y = element_blank(), axis.text.y = element_blank()) +
  coord_fixed()

## perform regularized log transformation and PCA and plot with ggplot2
rld <- rlogTransformation(dds)
pcaData <- plotPCA(rld, intgroup = c("segment"), returnData = TRUE)
percentVar <- round(100 * attr(pcaData, "percentVar"))
pcaData %>% 
  ggplot(aes(PC1, PC2, color = segment)) +
  geom_point(size = 3) +
  geom_text_repel(aes(label = colnames(rld)), force = 5,
                  arrow = arrow(length = unit(0.03, "npc"),
                                type = "closed", ends = "first")) +
  xlab(paste0("PC1: ",percentVar[1],"% variance")) +
  ylab(paste0("PC2: ",percentVar[2],"% variance")) + 
  theme(axis.ticks.x = element_blank(), axis.text.x = element_blank(),
        axis.ticks.y = element_blank(), axis.text.y = element_blank()) +
  coord_fixed()

Size factors

Size factors are a method of normalizing used by the DESeq function to normalize the data in terms of sequencing depth. Size factor is the median ratio of the sample over a pseudosample: for each gene, the geometric mean of all samples. Size factors account for differences in sequencing depth are typically centered around 1 (indicating comparable sequencing depth).

dds$sizeFactor %>%
  as.data.frame() %>% 
  tibble::rownames_to_column(var = "sample") %>% 
  dplyr::rename(size_factors = names(.)[2]) %>% 
  dplyr::left_join(meta_data, by = "sample") %>% 
  ggplot(aes(x = sample, y = size_factors, fill = segment)) +
  geom_bar(stat = "identity") +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1)) +
  geom_hline(yintercept = mean(dds$sizeFactor), color = "red", linetype = "dashed")

Dispersion plot

DESeq2 estimates gene dispersion using an algorithm that first generates gene-wise maximum likelihood estimates (MLEs) that are obtained using only the respective gene’s data (black dots). Then, a curve (red) is fit to the MLEs to capture the overall trend of dispersion-mean dependence. This fit is used as a prior mean for a second estimation round, which results in the final maximum a priori (MAP) estimates of dispersion. This results in a “shrinkage” of the noisy gene-wise estimates toward the consensus represented by the red line. The black points circled in blue are detected as dispersion outliers and not shrunk toward the prior (shrinkage would follow the dotted line). A more in-depth theoretical explanation of the DESeq2 algorithm can be found here

plotDispEsts(dds, main = "DESeq2 Dispersion plot")

Quality assessment (QC)

Independent Filtering

DESeq2 performs independent filtering by default using the mean of normalized counts as a filter statistic. A threshold on the filter statistic (first value) is found which optimizes the number of adjusted p-values lower than significance level alpha. The adjusted p-values for the genes which do not pass the filter threshold are set to NA. The results also returns the mean of normalized counts (second value).

cat("Filter thresh. val. and mean of norm. counts", gsub("_", " ", sel_comp1), " \n")
#> Filter thresh. val. and mean of norm. counts Chatpos vs Chatneg
metadata(res1)$filterThreshold
#> 62.04082% 
#>  5.105038

Plot of sample rejections vs filter quantiles

The filterThreshold returns the threshold chosen (vertical line in the plots below) by the DESeq2 analysis of the lowest quantile of the filter for which the number of sample rejections is within 1 residual standard deviation to the peak of a curve fit to the number of rejections over the filter quantiles. The following diagnostic plot shows the number of rejected samples (y-axis) plotted against quantiles of filter (x-axis).

par(mfrow = c(1, 1))
plot(metadata(res1)$filterNumRej, type = "b", main = gsub("_", " ", sel_comp1),
     xlab = "Quantiles of filter", ylab = "Number of rejections")
lines(metadata(res1)$lo.fit, col = "red")
abline(v = metadata(res1)$filterTheta)

Histogram of frequency of p-values of results

The following plot shows the number of frequency of counts (y-axis) against p-values between 0 and 1 (x-axis).

par(mfrow = c(1, 1))
hist(res1$pvalue, col = "lavender", xlab = "p-values", main = gsub("_", " ", sel_comp1))

Results

print(gsub("_", " ", sel_comp1))
#> [1] "Chatpos vs Chatneg"
print(summary(res1))
#> 
#> out of 19958 with nonzero total read count
#> adjusted p-value < 0.05
#> LFC > 0 (up)       : 428, 2.1%
#> LFC < 0 (down)     : 243, 1.2%
#> outliers [1]       : 3, 0.015%
#> low counts [2]     : 12382, 62%
#> (mean count < 5)
#> [1] see 'cooksCutoff' argument of ?results
#> [2] see 'independentFiltering' argument of ?results
#> 
#> NULL

Results Annotation

We annotate the DE results with mouse Ensembl and Entrez IDs by accessing the BioMart database.

## annotate DE results with Biomart database
# listAttributes(mart = useDataset("mmusculus_gene_ensembl", useMart("ensembl")))
# getBM(attributes = c("ensembl_gene_id", "entrezgene_id", "external_gene_name"),
#       mart = useDataset("mmusculus_gene_ensembl", useMart("ensembl"))) %>%
#   dplyr::rename(ensembl_id = ensembl_gene_id,
#                 entrez_id = entrezgene_id,
#                 genes = external_gene_name) %>%
#   dplyr::filter(stringr::str_length(genes) > 1,
#                 !duplicated(ensembl_id)) -> mouse_biomart
# saveRDS(object = mouse_biomart, file = "./data/mouse_biomart.rds")
mouse_biomart <- readRDS(file = "../202311_piol/data/mouse_biomart.rds")

The annotated DE results are arranged by p-adjusted value. Normalized counts from the are added to the right of the DE statistics.

## this code chunk extracts the Cook's Distance
as.data.frame(mcols(dds)$maxCooks) %>%
  dplyr::rename(cooks_stat = names(.)[1]) %>%
  dplyr::arrange(desc(cooks_stat)) %>% 
  tibble::rownames_to_column(var = "genes") -> top_cooks

merge(as.data.frame(res1), as.data.frame(counts(dds, normalized = TRUE)),
      by = "row.names", sort = FALSE) %>%
  dplyr::rename(genes = names(.)[1]) %>%
  dplyr::left_join(mouse_biomart, by = "genes") %>%
  dplyr::left_join(top_cooks, by = "genes") %>%
  dplyr::arrange(padj) %>%
  dplyr::select(genes, contains("_id"), cooks_stat, everything()) %>%
  as.data.frame() -> res_df1

MA plots

A MA plot illustrates log-fold expression change between two groups of samples, created by transforming and the data onto two scales: M (the log of the ratio of level counts for each gene between two samples) and A (the average level counts for each gene across the two samples) scales. MA plots demonstrates the difference between samples in terms of signal intensities of read counts. In this type of plot, genes with similar expression levels in two samples will appear around the horizontal line y = 0 (red line). The following MA plot illustrates log-fold expression change for each comparison after the DESeq2 analysis. Significant genes (P < 0.05) are highlighted in red.

results1 <- as.data.frame(res_df1)
results1[is.na(results1)] <- 0.99 # change NA results to 0.99 for correct MAplot
results1 %>%
  ggplot(aes(x = baseMean, y = log2FoldChange)) +
  geom_point(aes(colour = padj < 0.05), size = 0.5) +
  scale_colour_manual(name = 'padj < 0.05',
                      values = setNames(c('red','black'), c(TRUE, FALSE))) +
  scale_x_continuous(trans = "log10", limits = c(0.1, 300000)) +
  geom_smooth(colour = "red") +
  geom_abline(slope = 0, intercept = 0, colour = "blue") +
  theme(plot.title = element_text(hjust = 0.5)) +
  xlab("baseMean (A)") + ylab("log2FoldChange (M)") +
  ggtitle(paste0("MA plot for ", gsub("_", " ", sel_comp1)))

Volcano plots

A volcano plot is a type of scatter plot that is used to quickly identify genes that display large magnitude changes that are also statistically significant. A volcano plot is constructed by plotting the negative log of the p-value on the y-axis. This results in data points with low p-values appearing toward the top of the plot. The x-axis is the log of the fold change between the two experimental conditions. The log of the fold change is used so that changes in both directions appear equidistant from the center. Plotting points in this way results in two regions of interest in the plot: those points that are found toward the top of the plot that are far to either the left- or right-hand sides. These represent values that display large magnitude fold changes (on the left or right of center) as well as high statistical significance (toward the top). The following Volcano plot shows log of the fold change and negative log of the p-values for each comparison. Significant genes (P < 0.05) with log2 fold change (> 1) are highlighted in red.

res_df1 %>%
  dplyr::filter(!is.na(pvalue)) %>%
  dplyr::mutate(threshold = as.factor(abs(log2FoldChange) > 1 & pvalue < 0.05),
                sig_group = as.factor(dplyr::case_when(
                  log2FoldChange > 1 & pvalue < 0.05 ~ "Control",
                  log2FoldChange < -1 & pvalue < 0.05 ~ "FUS",
                  TRUE ~ "not significant"))) %>%
  ggplot(aes(x = log2FoldChange, y = -log10(pvalue), color = sig_group)) +
  geom_point(alpha = 0.75, size = 0.75) +
  geom_hline(yintercept = -log10(0.05), color = "red", linetype = "dashed") +
  geom_vline(xintercept = -1, color = "red", linetype = "dashed") +
  geom_vline(xintercept = 1, color = "red", linetype = "dashed") +
  xlab("log2 fold change") +
  ylab("-log10 p-value") +
  theme(plot.title = element_text(hjust = 0.5)) +
  scale_color_manual(values = c("limegreen", "magenta", "gray3")) +
  ggtitle(paste0("Volcano plot for ", gsub("_", " ", sel_comp1)))

Count plots

Count plots are created for the top 5 DE genes for each comparison.

res_df1 %>%
  dplyr::slice(1:5) %>%
  dplyr::select(genes, tidyr::contains("DSP")) %>%
  tidyr::gather(key = "sample", value = "count", -genes) %>% 
  dplyr::left_join(meta_data, by = "sample") %>%
  ggplot(aes(x = sample, y = log10(count), color = segment)) +
  geom_point() +
  theme(axis.text.x = element_blank()) +
  ggtitle(paste0("Gene counts of top 5 genes for ", gsub("_", " ", sel_comp1),
                 " ", tissue)) +
  facet_grid(~genes)

Results tables

Below are three html tables displaying the results the DE comparisons. These tables are interactive and be queried for specific proteins or sorted by column.

Note on p-values set to NA: some values in the results table can be set to NA for one of the following reasons: * If within a row, all samples have zero counts, the baseMean column will be zero, and the log2 fold change estimates, p-value and adjusted p-value will all be set to NA. * If a row contains a sample with an extreme count outlier then the p-value and adjusted p-value will be set to NA. These outlier counts are detected by Cook’s distance. * If a row is filtered by automatic independent filtering, for having a low mean normalized count, then only the adjusted p-value will be set to NA.

Results table Chatpos vs Chatneg

res_df1 %>%
  dplyr::select(-c(contains("_id"), cooks_stat)) %>%
  dplyr::mutate_at(vars(baseMean:stat), round, 3) %>%
  dplyr::mutate_at(vars(matches("DSP")), round, 3) %>%
  DT::datatable()

Save Results

Results are saved as .csv and .rds files which are unfiltered.

These Excel files have the following columns:

  • genes = Hugo gene symbol (official gene name)
  • entrez_id = Entrez gene ID
  • ensembl_id = Ensembl gene ID
  • non_zero_de = number of non-zero values across gene rows
  • perc_non_zero_de = percent non-zero values across gene rows out of all samples
  • baseMean = DESeq2 average mean count per gene
  • log2FoldChange = log2 normalized fold change between two DE conditions
  • lfcSE = standard Normal distribution to generate a two-tailed p-value
  • stat = the difference in deviance between the reduced model and the full model, which is compared to a chi-squared distribution to generate a pvalue
  • pvalue = p-value
  • padj = Benjamini-Hochberg FDR p-adjusted value

Note: Cook’s Distance could not be calculated because we have genotype groups of less than 4 samples.

## populdate list of DE results
de_list <- list(res_df1)

## name list headers to become sheet names
names(de_list) <- c(paste0(gsub("_", " ", sel_comp1)))

## save as Excel and RDS
saveRDS(object = de_list, file = paste0(
  "./data/", proj, "_", tissue, "_", sel_comp, "_DE_res_", Sys.Date(), ".rds"))
writexl::write_xlsx(x = de_list, path =  paste0(
  "./data/", proj, "_", tissue, "_", sel_comp, "_DE_res_", Sys.Date(), ".xlsx"))

Session Info

sessionInfo()
#> R version 4.4.0 (2024-04-24)
#> Platform: aarch64-apple-darwin20
#> Running under: macOS Sonoma 14.5
#> 
#> Matrix products: default
#> BLAS:   /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib 
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0
#> 
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#> 
#> time zone: Europe/Brussels
#> tzcode source: internal
#> 
#> attached base packages:
#> [1] stats4    stats     graphics  grDevices utils     datasets  methods  
#> [8] base     
#> 
#> other attached packages:
#>  [1] DT_0.33                     biomaRt_2.60.0             
#>  [3] gridExtra_2.3               janitor_2.2.0              
#>  [5] writexl_1.5.0               readr_2.1.5                
#>  [7] readxl_1.4.3                pheatmap_1.0.12            
#>  [9] RColorBrewer_1.1-3          DESeq2_1.44.0              
#> [11] SummarizedExperiment_1.34.0 Biobase_2.64.0             
#> [13] MatrixGenerics_1.16.0       matrixStats_1.3.0          
#> [15] GenomicRanges_1.56.0        GenomeInfoDb_1.40.0        
#> [17] IRanges_2.38.0              S4Vectors_0.42.0           
#> [19] BiocGenerics_0.50.0         ggrepel_0.9.5              
#> [21] ggplot2_3.5.1               tibble_3.2.1               
#> [23] tidyr_1.3.1                 dplyr_1.1.4                
#> 
#> loaded via a namespace (and not attached):
#>  [1] DBI_1.2.2               httr2_1.0.1             rlang_1.1.3            
#>  [4] magrittr_2.0.3          snakecase_0.11.1        compiler_4.4.0         
#>  [7] RSQLite_2.3.7           mgcv_1.9-1              png_0.1-8              
#> [10] vctrs_0.6.5             stringr_1.5.1           pkgconfig_2.0.3        
#> [13] crayon_1.5.2            fastmap_1.2.0           dbplyr_2.5.0           
#> [16] XVector_0.44.0          labeling_0.4.3          utf8_1.2.4             
#> [19] rmarkdown_2.27          tzdb_0.4.0              UCSC.utils_1.0.0       
#> [22] purrr_1.0.2             bit_4.0.5               xfun_0.44              
#> [25] zlibbioc_1.50.0         cachem_1.1.0            jsonlite_1.8.8         
#> [28] progress_1.2.3          blob_1.2.4              highr_0.11             
#> [31] DelayedArray_0.30.0     BiocParallel_1.38.0     parallel_4.4.0         
#> [34] prettyunits_1.2.0       R6_2.5.1                bslib_0.7.0            
#> [37] stringi_1.8.4           lubridate_1.9.3         jquerylib_0.1.4        
#> [40] cellranger_1.1.0        Rcpp_1.0.12             knitr_1.46             
#> [43] splines_4.4.0           Matrix_1.7-0            timechange_0.3.0       
#> [46] tidyselect_1.2.1        rstudioapi_0.16.0       abind_1.4-5            
#> [49] yaml_2.3.8              codetools_0.2-20        curl_5.2.1             
#> [52] lattice_0.22-6          withr_3.0.0             KEGGREST_1.44.0        
#> [55] evaluate_0.23           BiocFileCache_2.12.0    xml2_1.3.6             
#> [58] Biostrings_2.72.0       filelock_1.0.3          pillar_1.9.0           
#> [61] generics_0.1.3          hms_1.1.3               munsell_0.5.1          
#> [64] scales_1.3.0            glue_1.7.0              tools_4.4.0            
#> [67] locfit_1.5-9.9          grid_4.4.0              crosstalk_1.2.1        
#> [70] AnnotationDbi_1.66.0    colorspace_2.1-0        nlme_3.1-164           
#> [73] GenomeInfoDbData_1.2.12 cli_3.6.2               rappdirs_0.3.3         
#> [76] fansi_1.0.6             S4Arrays_1.4.0          gtable_0.3.5           
#> [79] sass_0.4.9              digest_0.6.35           SparseArray_1.4.1      
#> [82] farver_2.1.2            htmlwidgets_1.6.4       memoise_2.0.1          
#> [85] htmltools_0.5.8.1       lifecycle_1.0.4         httr_1.4.7             
#> [88] bit64_4.0.5

References:

Love, M.I., Huber, W., Anders, S. (2014) “Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2.” Genome Biology, 15:550. 10.1186/s13059-014-0550-8

Anders, Simon, and Wolfgang Huber. 2010. “Differential Expression Analysis for Sequence Count Data.” Genome Biology 11:R106. http://genomebiology.com/2010/11/10/R106.

Ritchie ME, Phipson B, Wu D, Hu Y, Law CW, Shi W, Smyth GK (2015). “limma powers differential expression analyses for RNA-sequencing and microarray studies.” Nucleic Acids Research, 43(7), e47. doi: 10.1093/nar/gkv007.